August 23rd, 2021

Cost Optimized Metrics through Log Analytics

Abinav Rameesh
Program Manager, Azure Cosmos DB

Azure Cosmos DB is designed for high throughput and low latency workloads, with the ability to serve millions of requests per second and billions of requests a day. In addition to seeking a scalable database that can serve massive volumes of traffic, users also look for granular server-side metrics for monitoring the health and performance of their mission-critical workloads built on Azure Cosmos DB.

Azure Cosmos DB can be integrated as an opt-in feature with Azure Log Analyticsan Azure service that provides querying, monitoring, and alerting capabilities for Azure metrics.

However, with billions of requests to the database, coupled with the need to retain this data for application performance analysis, the data footprint of these metrics can start to add up. While Log Analytics only charges for storage, not for the compute required for querying and retrieving metrics, the cost of ingestion and retention of these metrics can get large for a high throughput database.

 

Savings with the new pipeline

Azure Cosmos DB recently launched improvements to significantly reduce the cost of the Log Analytics integration, by migrating to a new metrics pipeline called Resource Specific tables. Previously, all Azure services pushed metrics to a common table in Log Analytics, which made the format of these tables rigid. Furthermore, additional data was needed to distinguish metrics for each service in the common platform, leading to a higher footprint for metrics.

By moving to the new pipeline, Azure Cosmos DB metrics are now available in “Resource Specific” tables with a schema of their own. This has significantly lowered the data footprint and by extension, lowered the cost of ingestion and retention of Azure Cosmos DB metrics in Log Analytics.

With a simple toggle in the UI along with minor changes to the queries, metrics can now be retrieved through Log Analytics at a significantly reduced cost. Tabulated below are cost savings for the most frequently used diagnostic settings in Azure Cosmos DB.

Image Cost Optimized Log AnalyticsFigure 1: Expected savings for each Diagnostic Setting capturing server-side metrics for Azure Cosmos DB

 

Additional Improvements

Previously, all metrics were published as strings requiring the casting of numeric values. Additionally, column names for different categories of diagnostics were not uniform. With resource specific tables, naming conventions are consistent and numeric fields no longer need to be cast to be queried effectively.

 

Querying Server-Side Metrics

Detailed below are a flowchart and queries, highlighting performance monitoring with Log Analytics.

Image Metrics Debugging Flowchart

Figure 2 – Flowchart for debugging application latencies by leveraging Azure Cosmos DB metrics in Log Analytics

 

Log Analytics (Resource Specific) Queries for each of the steps above:

  1. Spikes in server latencies?
    CDBDataPlaneRequests
    | where TimeGenerated > now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | summarize max(DurationMs) by bin(TimeGenerated, 10m)
    | render timechart
  2. Few or multiple physical partitions?
    CDBDataPlaneRequests
    | where TimeGenerated > now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | summarize max(DurationMs) by bin(TimeGenerated, 10m), PartitionId
    | render timechart
  3. Throttled requests in the time window?
    CDBDataPlaneRequests
    | where TimeGenerated > now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | where StatusCode == 429
    | summarize count() by bin(TimeGenerated, 10m)
    | render timechart
  4. Larger request volume in time window?
    CDBDataPlaneRequests
    | where TimeGenerated > now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | summarize count() by bin(TimeGenerated, 10m)
    | render timechart
  5. Fetch specific requests that spiked
    CDBDataPlaneRequests
    | where TimeGenerated > now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | summarize count() by bin(TimeGenerated, 10m), OperationName
    | render timechart
  6. Higher RU/s per operation in time window
    CDBDataPlaneRequests
    | where TimeGenerated > now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | summarize max(RequestCharge) by bin(TimeGenerated, 10m), OperationName
    | render timechart
  7. Larger payload size for write operations?
    CDBDataPlaneRequests
    | where TimeGenerated > now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | where OperationName in ("Create", "Upsert", "Delete", "Execute")
    | summarize max(RequestLength) by bin(TimeGenerated, 10m), OperationName
    | render timechart
  8. Larger response size for reads?
    CDBDataPlaneRequests
    | where TimeGenerated > now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | where OperationName in ("Read", "Query")
    | summarize max(ResponseLength) by bin(TimeGenerated, 10m), OperationName
    | render timechart
  9. Is a common logical partition the culprit?
    CDBPartitionKeyRUConsumption
    | where TimeGenerated > now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | summarize sum(RequestCharge) by PartitionKey, PartitionKeyRangeId
    // 9.1 Is a common logical partition consuming more storage than others?
    CDBPartitionKeyStatistics
    | where TimeGenerated > now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | summarize StorageConsumed = sum(SizeKb) by PartitionKey
    | order by StorageConsumed desc
  10. Server side timeouts observed?
    CDBDataPlaneRequests
    | where TimeGenerated >= now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | where StatusCode == 408
    | summarize count() by bin(TimeGenerated, 10m)
    | render timechart
  11. (&12) Jump in cross region calls? Observed in a few or several client VMs?
    // Confirm the regions against which Client IPs are sending requests to Cosmos DB
    // Setting the region name in the SDK's 'UserAgent' can also be used to check for cross region calls
    CDBDataPlaneRequests
    | where TimeGenerated >= now(-6h)
    | where AccountName == "LA-METRICS-DEMO"
    | summarize count() by ClientIpAddress, RegionName

 

Next Steps

To learn more about the new resource-specific tables in Azure Log Analytics see, Monitoring Azure Cosmos DB with Resource Specific Tables in Log Analytics.

Also, stay tuned to this blog for an upcoming post on monitoring client-side performance through diagnostic metrics in the Azure Cosmos DB Java v4 Client

Author

Abinav Rameesh
Program Manager, Azure Cosmos DB

Abinav is a Program Manager on Azure Cosmos DB - focusing on High Availability, the Cassandra API and MI offerings as well as customer success.

0 comments

Discussion are closed.